Determining an effective short term COVID-19 prediction model in ASEAN countries

The challenge of accurately short-term forecasting demand is due to model selection and the nature of data trends. In this study, the prediction model was determined based on data patterns (trend data without seasonality) and the accuracy of prediction measurement. The cumulative number of COVID-19 affected people in some ASEAN countries had been collected from the Worldometers database. Three models [Holt’s method, Wright’s modified Holt’s method, and unreplicated linear functional relationship model (ULFR)] had been utilized to identify an efficient model for short-time prediction. Moreover, different smoothing parameters had been tested to find the best combination of the smoothing parameter. Nevertheless, using the day-to-day reported cumulative case data and 3-days and 7-days in advance forecasts of cumulative data. As there was no missing data, Holt’s method and Wright’s modified Holt’s method showed the same result. The text-only result corresponds to the consequences of the models discussed here, where the smoothing parameters (SP) were roughly estimated as a function of forecasting the number of affected people due to COVID-19. Additionally, the different combinations of SP showed diverse, accurate prediction results depending on data volume. Only 1-day forecasting illustrated the most efficient prediction days (1 day, 3 days, 7 days), which was validated by the Nash–Sutcliffe efficiency (NSE) model. The study also validated that ULFR was an efficient forecasting model for the efficient model identifying. Moreover, as a substitute for the traditional R-squared, the study applied NSE and R-squared (ULFR) for model selection. Finally, the result depicted that the prediction ability of ULFR was superior to Holt’s when it is compared to the actual data.

The widespread coronavirus disease  began in China (Hubei Province) in December 2019. Multiple phases of the pattern of the COVID-19 disease continue to be the source of infections in numerous countries, frightening to turn out to be a long-term epidemic. The main reason for spreading this virus is that it was not detected early. Researchers have already begun investigating how the pandemic affects global macroeconomic trends 1 . However, essential to predict how much the virus is spreading rather than measuring global financial loss.
A former study has shown that biological growth models, especially the sub-epidemic growth model, can reveal the empirical shapes of past epidemics that support producing short-term forecasts of an epidemic model in real-time. These epidemic models are beneficial when the collected data are limited for bringing any ideal information 2,3 . In addition, the short period of real-time forecasts created from such models could support preparing to allocate the necessary resources, which are crucial to bringing the epidemic under control.
It is known that each communicable disease epidemic shows specific patterns. Seasonal changes and an adaptation of the virus over time are critical reasons for outbreaks showing different patterns in a country or region. Showing different patterns is typically happening at a different scale level concerning time. Generally, pandemic patterns occur in non-linear shapes. Therefore, this characteristic inspires us to create a model to identify such non-linear lively fluctuations 4 . Moreover, the transmission of such transmittable diseases could be defined by analysing the pattern of these non-linear systems.
The main objective of this research is to determine an effective short-term prediction model by utilizing the cumulative number of COVID-19 infected people in four ASEAN (Association of Southeast Asian Nations) countries, i.e., Thailand, Philippines, Singapore, and Indonesia. To the best of our knowledge, no literature has analysed ASEAN countries' to predict short term COVID-19 cases, mainly predicting 1 day. Typically, short-term reliable prediction can assist decision-makers in managing the pandemic efficiently in real time and mitigate www.nature.com/scientificreports/ any upcoming severe risks. Besides, the study focuses on the four countries for the following motives. In the first place, ASEAN countries, especially these four countries, contribute to a vast global economy. Secondly, ASEAN countries' COVID-19 data is available in Worldometers. Therefore, it is crucial to understand how to predict the COVID-19 spreading rate at some ASEAN countries and take effective measures to control the pandemic. Hence, this paper principally uses three dynamic models to generate 1-day ahead or multiple day's ahead forecasts of the cumulative reported cases of COVID-19 by addressing any seasonal variations. Firstly, the prediction model, Holt's, and prediction parameters are selected. Secondly, it employs Wright modified Holt's method for forecasting irregularity in time spacing. Thirdly, an unreplicated linear functional relationship model (ULFR) is applied to find a more justified forecasting value of dependent and independent variables. This ULFR model is a newly used technique in such a type of prediction analysis. Last but not least, the prediction Holt's model was selected based on data patterns (trend data without seasonality) and the accuracy of prediction measurement with different statistical methods. Particularly, instead of using only the traditional R-squared, this study also used the nash-sutcliffe efficiency (NSE), model efficiency factor (MEF) and the R-squared value of ULFR for model selection.

Literature review
Since the outbreak in Wuhan, several worldwide scholars and researchers have reported estimations and predictions for the COVID-19 epidemic in journal publications or on websites [5][6][7] . Among the studies, many modelling results have shown a wide range of variations depending on data availability 8 : estimated basic reproduction number varies from 2 to 6, peak time estimated from mid-February to late March, and the total number of infected people ranges from 50,000 to millions. An important question is now, even though predictions are made using transmission models based on either the Susceptible-Infectious-Removed (SIR) or Susceptible-Exposed-Infectious-Removed (SEIR) framework, why is there such a wide variation in model predictions? 9 . The literature review shows that the primary reason for such variation is selecting a small number of data or selecting parameters without justification 10 .
Recently applied various models for measuring spreading and forecasting COVID-19 such as Ace-Mod Australian Census-based Epidemic Model (Ace-Mod), SIDR, Fuzzy Clustering, SEIR and DASS-21 [11][12][13] . Particularly, Roosa et al. 13 generate forecasts in China using the Richards growth model, a sub-epidemic wave model, and a generalized logistic growth model. Following models have been previously used to forecast outbreaks due to different infectious diseases. Forecasts finding from each model suggest the outbreaks may be nearing extinction in Guangdong and Zhejiang.
Fokas et al. 14 applied a computational approach to computing the impact on the number of deaths globally and Paul et al. 15 predict the disease burden with special emphasis on south Asian countries (India, Bangladesh, and Pakistan) by the SEIR method. Hasan and Siddik 16 Examine the correlation between daily and total COVID-19 cases in Bangladesh by linear relationship model. Petropoulos and Makridakis 17 applied exponential smoothing models to predict the continuation of the COVID-19 by analyzing confirmed cases, deaths, and recoveries of several people. Moreover, Md Hasinur Rahaman Khan and Ahmed Hossain 18 used the Infection Trajectory-Pathway Strategy (ITPS) model to analyze the COVID-19 outbreak situations and predict infections and deaths case from temporal data of confirmed and death cases in Bangladesh. Besides, Fanelli and Piazza 11 applied the SIDR model to analyze and forecast COVID-19 spreading in China, Italy, France with a concentration in the variables (Susceptible, Infected, Recovered, Dead, Scheme). Besides, Zhou et al. 19 focused on deep neural network(DNN) and convolutional neural network (CNN) model for behavioural analysis based on heterogeneous health data generated in social media. Furthermore, Mahmoudi et al. 20 studied the relationship between the spread of Covid-19 and population size by applying the Fuzzy clustering model in USA, Spain, Italy, Germany, UK, France, and Iran. Even though a few researchers used regression models and Genetic programming for short-term prediction of COVD-19 cases in the different regions due to the small number of data or parameter selection problems, many of those models' outcomes have shown a wide range of dissimilarities 10,17 . Notably, some studies [13][14][15][16][17]21 have used various mathematical models to determine the spread of the disease, predict the number of incidences, deal with healthcare missing data, health care facilities in tackling COVID-19 spread. In addition, some researcher uses the SEIR model, Genetic Programming and Regression model for short -term prediction [22][23][24] . Moreover, a dynamic model, a generalized logistic growth model, the Richards growth model, and a sub-epidemic wave model were used to generate 5-day and 10-day ahead forecasts of the cumulative reported cases in the provinces of Guangdong and Zhejiang, China 13 . Consequently, selecting the appropriate model and parameter value is key for the forecasting model. Besides, many models are available in the literature to model infectious diseases. A few models have been used primarily for the countries where the number of cases is very high, like China, Italy, Spain, the UK, Germany and the USA. Still, no literature has analysed ASEAN countries' to predict short term COVID-19 cases mainly predicting for 1 day.

Methodology
In forecasting, the popular methods are the average approach, Naïve approach, Drift method, Seasonal naïve approach, Time series methods, Econometric forecasting methods, artificial neural networks. In economics, the widely applied methods fall in Time series methods and Econometric forecasting methods. However, in health science, rarely forecasting models are applied. Unfortunately, COVID-19 teaches us the importance of prediction results to control the spread of the disease. For the model selection, the study first applied the rule of thumb. Then after analyzing data, three models (Holt's method, Wright's modified Holt's method, and unreplicated linear functional relationship model (ULFR)) have been utilized to identify an efficient model for short-time prediction. Afterwards, this study has tested different smoothing parameter by MAPE, MAD, MSE and RMSE to select effective smoothing parameters. Finally, the model is validated by the NSE, MEF, traditional R-squared, www.nature.com/scientificreports/ and R-squared (ULFR). Therefore, the present study relies on the three mentioned models that are first applied in health science to forecast the number of COVID-19 incidences in four ASEAN countries based on the actual historical data of August 20, 2020, to September 16, 2020.
Holt's method. To build Holt's method, first, the exponential smoothing technique was projected in the late 1950s 25 . In addition, the exponential smoothing technique has motivated few of the most practical forecasting approaches. The exponential smoothing method to produce forecasts is defined as weighted averages of former observations. In other words, Holt's is defined as a linear-exponential smoothing. This smoothing model is well known for forecasting data with trends. This model consists of three individual equations that are applied together to create a final forecast result. Among the three equations, the first equation is a smoothing equation that replaces the last period's trend value with the last smoothed value. Besides, the second equation is called the trend equation. The second equation consists of the changes between the last two smoothed values. The final equation consists of level and trend values to find the forecast value. Two parameters are used in Holt's method are called smoothing parameters. One parameter is used for overall smoothing, and the other one is used for trend smoothing. Therefore, another name of holt's method is the double exponential smoothing or trend exponential smoothing model 26 . It can be expressed by the following three equations: Herein the smoothing constant and variables are defined below: Y t Estimate of the level of the series at time t, Z t Estimate of the trend (slope) of the series at time t, Smoothing parameter for the trend, X t Estimate of the period t base level from the current period, Y t−1 + Z t−1 Estimate of the period t base level based on previous data. To calculate the optimal forecasting of the Eq. (3), the following optimization technique to minimizing the squared error over all data points: Min To calculate Z t the following two quantities are taken as a weighted average: 1. Estimation of a trend from the current period from the upsurge in the smoothed trend between the periods (t-1) and t. 2. Z t−1 , which is the previous estimate of the trend.
To start Holt's method, Y 0 is an initial estimate of the level and another an initial estimate is called Z 0 which is used of the trend. Here, Z 0 equals to previous year's average increase in the time series and Y 0 equals to last observation.
Wright's modified Holt's method. Wright 27 introduced a modification of Holt's method for the data with irregularity in time spacing. However, modified Holt's method has some unique characteristics, i.e. it has greater computational efficiency and flexibility in terms of having more smoothing constants. Besides, the modified model has a better performance record with empirical data with missing data or zero data.
The notation and forecasting equation of Wright's modified Holt's method is illustrated below: For the Eqs.
One period ahead forecast of the next infected number: f n+ 1 = l n + m n ( t n − t n−1 )  www.nature.com/scientificreports/ between the variables becomes a fuzzy relationship due to unusual fluctuations with defined variables. Moreover, as 28 mentioned in an article, it is unlikely to measure precisely independent variables in all circumstances. The relationship model for the two variables is introduced following way by considering the above issues. Assume that X and Y are two linearly related unobservable variables. In the Eq. (7) X and Y are defined as a linearly independent variable and target variable respectively.
The parameters value β a , β f of the Eq. (7) can be found by least square loss function (minimizing the squared error) over all data points : Min The linear functional relationship model assigns dependent and independent variables by assuming that both the variables are subject to errors. Let, two variables X i and Y i which correspond to random variables x i and y i that are observed with errors, d i and e i respectively, such that, Chang et al. 29 termed the defined Eqs. (7) and (8) as ULFR model. In this model, it is assumed that errors d i and e i are mutually independent and normally distributed random variables.
When the ratio of the error variance is known, that is The functional form of R-squared is as follows: www.nature.com/scientificreports/ Nash-Sutcliffe efficiency. To avoid the shortfall of R-squared, the study also used the NSE factor presented by 31 to determine the efficiency of the models: This formula can be applied for linear regression and original data on any model. Besides, NSE values can be negative for the non-linear models. The value of NSE could be between −∞ and 1. Naturally, analysts seek an NSE value close to 1 for the best performance of a model. The negative result of NSE specifies an improper model efficiency. The NSE has an excellent reputation for analysing the efficiency of any model, especially the frequently applied model in hydrology 32,33 . Besides, this study is also introduced the model efficiency factor (MEF) to give a comprehensive explanation of NSE from a different perspective, i.e.
Any smaller value of MEF shows the validation of the model. Thus, the values of MEF ranges between 0 and 1, where the smallest value of the MEF shows excellent models' performance. However, the value of MFE zero indicates an error-free model, which is not convincing result 34 . Dataset. The data in this research were collected from Worldometers info (https:// www. world omete rs. info/ coron avirus/), provided with confirmed cases in ASEAN countries from 20th August to 16th September 2020. There are several reasons behind collecting data from Worldometer. First, there is no reliable free data source for ASEAN countries except Worldometer. Secondly, Worldometer is accredited as the oldest and most reliable data source by the American Library Association (ALA), Johns Hopkins CSSE, Financial Times, The New York Times and Government of UK. Finally, Worldometer is cited as a source of data in thousands of renowned journal articles. The data set in this study included all the ASEAN countries.
Since, trend data are the pre-requisite for better predicting using Holt's method 26 , therefore some ASEAN countries, Laos, Vietnam, Brunei, Cambodia, Myanmar were excluded from the study were excluded from this study. Besides, we were unable to get the consistent trend pattern of COVID-19 infected cases for these countries. Even though Wright's modified Holt's method can deal with zero values 27 , the number of zero values was almost all for these countries.. The summary statistics of the data set is given in Table 1. In the table, it is seen that total number data is 28 for all the country. However, standard deviation and maximum and minimum value has huge difference because each country has different number of population.

Result and discussion
Selection of smoothing constant for the prediction model. For obtaining the significant result from Holt's model, it is imperative to determine the best combination of smoothing constant. Therefore, various models such as Ace-Mod (Australian Census-based Epidemic Model) 35 , neural network-based models 36 and others have been employed to access the situation. However, though these models are significant, they are not appropriate for trend and seasonal data 25 . Table 2 shows the best combinations of smoothing constants for different days in four ASEAN countries. These findings are validated by MAPE (mean absolute percentage error) value as it is mentioned that the MAPE is one of the most popular measures of forecast accuracy 37 .

Model validation.
Several studies on validating the forecasted model of COVID-19 transmission are based on R-squared value only. Nevertheless, our research examined the accuracy and validated our prediction models using R-squared, R-squared (ULFR), NSE, and MEF values. From Table 3, it is noticed that the ULFR model predict the total cases (1-day ahead) with better accuracy compared with Holt's. In addition, the ULFR with a www.nature.com/scientificreports/   www.nature.com/scientificreports/ 1-day ahead prediction had the highest accuracy among the three different days' predictions. Furthermore, it is noticed that Indonesia and Singapore have the highest R-squared and NSE value, which approaches unity. Besides, Indonesia and Singapore show the lowest MEF, which approaches zero. Therefore, to find the forecasting accuracy of the proposed model, this study has been verified by comparing two data sets, i.e., the forecasted data and the reported data. These data sets are applied in the following statistical model to validate the model and maximize the forecasting accuracy. All the statistical evidence shows that ULFR is the best prediction model than Holt's model. Moreover, in the result and discussion section the modified Holt's method (Wright's model) is not included because Holt's and Wright's model showing the same result and it is accepted because there was no missing observations in our studied data 38 .

Forecasting by Holt's Method and ULFR. In this section, Holt's and ULFR methods have been applied
to predict the number of confirmed cases in ASEAN countries, particularly in Thailand, Philippines, Singapore and Indonesia. The goal is to find accurate predictions, which is probably impossible, but to illustrate some general qualitative behaviours which may be observed 39 . Therefore, Figs. 1 and 2 show the cumulative cases of the forecast for 1, 3 and 7 days ahead by Holt's and ULFR methods respectively. From Fig. 1, it is seen that Holt's method shows better results for the Philippines and Indonesia for 1, 3 and 7 days. However, the forecasting result is not up to the mark for Thailand and Singapore. On the other hand, ULFR in Fig. 2 shows the descent prediction result for all four countries. Therefore, ULFR gives a better prediction result than Holt's. The fluctuating case may partially explain that long term prediction is less significant for both models even though ULFR shows significant results than Holt's. Additionally, it is seen clearly form the Figs. 3, 4 and 5 that short term (1 day) prediction is more accurate than long term prediction. In this case, ULFR shows a perfect prediction result than   40 has mentioned that COVID-19 is increasing globally at a rate of 3% to 5% daily. Sharif et al. 10 commented in their forecasting analysis that the cumulative number of infected cases would be raised to around 500,000 by 2021 in Dhaka division, Bangladesh. However, these studies did not show any justified prediction comparison. But, in our study we have compared original data and predicted data through graphical presentation. After comparing Figs. 3, 4 and 5 it is clearly visible that day 1 prediction result shows highest accuracy in prediction. Besides, statistical analysis in the Table 3 depicts that highest number of coefficient of determination values of Holt's and ULFR model for 1 day prediction. It is also observed that Fig. 4 depicts better accuracy than Fig. 5 which means 3 days prediction is better than 7 days prediction. However, in a study on deep learning-based forecasting model for COVID-19 outbreak in Saudi Arabia, Brazil, India, Saudi Arabia, South Africa, Spain, and USA 41 found that different countries has different trends. But, that study did not examine all the counties' prediction result with justification. They just showed their accurate result justification by using conventional RMSE, MSE and R 2 value. However, in our study we have analysed prediction result validation by coefficient of determination, NSE and MFE for both Holt's and ULFR result which are presented in the Table 3.
Moreover, Roosa et al. 13 analyzed daily reported cumulative data of COVID-19 cases of China (Guangdong and Zhejiang) generated from13 to 23 February 2020 to forecast only 10-day ahead cumulative case by using a generalized logistic growth model, Richards growth model, and a sub-epidemic wave model. Even though the study mentioned that short term prediction, they did not define short term is how much short. In our study we have used deferent regional data and compared the data for different days. In Figs. 1, 2 In another study the neuro-fuzzy system was used by Al-qaness et al. 42 by incorporation with marine predators algorithm to predict the COVID-19 cases in Italy, South Korea, Iran, and USA. A high coefficient of determination was obtained for all the forecasted nations; it was 96.48%, 96.96%, 98.74%, and 98.59% for South Korea, Iran, USA, and Italy respectively. On the other hand, in our study a very high coefficient of determination was obtained for day one and day three ahead prediction for all the counties; it was around 99% for all the four   www.nature.com/scientificreports/ countries (Thailand, Philippines, Singapore and Indonesia). However, worst prediction result was seen for 7 days ahead which is around 67% for Thailand.

Conclusion
To sum up, this is the first study to predict the number of infected COVID-19 cases in some ASEAN countries using Holt's and ULFR approaches. Based on our findings, for the short term (1-day prediction), Holt's and ULFR show better results. However, ULFR shows a significant result for 3 and 7 days too. This study's outcome could help ASEAN countries' governments observe the current COVID-19 condition and apply the forecast result to avoid further transmissions. We recommend conducting a follow-up study as many external factors estimate 100% forecast accuracy even though it is impossible to find 100% accurate results. For example, the daily COVID-19 infected cases reported by the government might be lower than the actual number of cases due to delays in publishing test results. Besides, some people would be immune before even screening the COVID-19 test.
There are a few limitations in this study that this study did not include more than 7 days' prediction due to a small data range. Besides, this study could not have all ASEAN countries due to a lack of balance data. Moreover, this study concentrated only on one factor: the number of confirmed cases. Thus, for future study, the next step would be to apply the Holt's and ULFR model to forecast optimal parameters such as the total confirmed cases, total deaths, and total recovered cases for more than 7 days ahead.